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Under the auspices of the U. S. Navy, we are designing and im- 
plementing a multiprocessor (the S-l) with at least ten times the com- 
putational power of the Cray-1. Our first step is to develop a general- 
purpose uniprocessor with a performance level comparable to the 
Cray-1; the multiprocessor will then be made up of 16 of these uni- 
processors, sharing a main memory. The uniprocessors can be used 
together for large problems or separately for several smaller 
problems. To reduce average memory-access time, each uniprocessor 
has a private cache memory. We have also developed a powerful 
design system (SCALD) that supports extremely efficient structured 
design of digital logic. Using advanced compiler and verification 
techniques, SCALD can complete the details of a computer design 
starting from a high-level specification. 



Our S-l Project has as its 
general goal the development of 
advanced digital processing 
technology for potential application 
throughout the U.S. Navy. This 
work involves the design and im- 
plementation of extremely high- 
performance general-purpose 
computers. 

The basic goals of the S-l 
Project may be divided into 
development-oriented and 
research-oriented sets. 

The primary development- 
oriented goal is to establish 
methods for faster design and im- 
plementation of advanced digital 
processors. Our approach to this 
goal includes the development of a 
design system that supports struc- 
tured computer-aided logic design 
and the development of automated 
implementation and debugging 
techniques. 

A second development-oriented 
goal is to provide prototype im- 
plementations of highly cost- 
effective digital processors against 



which the Navy may measure com- 
mercial offerings. We approach this 
goal in three ways: by developing a 
durable and extensible uni- 
processor instruction-set architec- 
ture (the S-l Native Mode) that will 
evolve in such a manner that the 
developing software base is unaf- 
fected by changes in the underlying 
hardware, by designing a common 
underlying hardware structure for a 
class of cost-effective, high- 
performance S-l Uniprocessors, 
and by developing a multiprocessor 
architecture and implementation 
that allows the S-l Uniprocessors 
to be used in a wide variety of ap- 
plications, particularly those requir- 
ing very large computing rates or 
high operational reliabilities. 

Our primary research-oriented 
goal is to invent and evaluate in use 



the concepts and languages 
necessary to support practical, 
high-level, general-purpose digital 
logic design. A second goal is to 
provide a practical multiprocessor 
research environment, by im- 
plementing multiprocessor 
hardware with sufficient computing 
capability to solve real problems of 
interest to real users. At the same 
time, we intend to implement and 
evaluate a fundamental new mul- 
tiprocessor architecture consisting 
of a fully-connected network of in- 
dependent processors, each with a 
private, hardware-managed cache 
memory. Finally, we must invent 
and evaluate operating-system, 
language, and hardware constructs 
that will support the partitioning of 
single large problems across multi- 
ple independent processors. 

The following sections divide dis- 
cussion of the Project's work 
toward these basic goals into three 
categories: S-l Multiprocessors, 
their constituent S-l Unipro- 
cessors, and the S-l Design System 
that supports the design of these 
S-l processors. 

Multiprocessors 

A multiprocessor is a network of 
computers that concurrently ex- 
ecute a number of independent in- 
struction streams on separate data 



streams while closely sharing main 
memory. A multiprocessor design 
offers significant advantages over a 
uniprocessor design that provides 
an equivalent computation rate. 
The advantages result from the 
modularity inherent in a mul- 
tiprocessor architecture and can be 
categorized as advantages of 
reliability, economy, and size. 

The advantage of reliability has 
been validated by the very reliable 
commercial systems that handle, 
for example, banking transactions 
and computer network 
communications. 1 In a well- 
designed multiprocessor system, 
failure of a single module (for ex- 
ample, a component uniprocessor, 
a crossbar switch, or a memory 
bank) does not entail failure of the 
entire system. Indeed, the 
operating system for the S-l Mul- 
tiprocessor (called Amber) is intend- 
ed to detect such module failures 
and automatically replace the func- 
tion from the available complement 
of reserve modules. 

Advantages of economy occur 
during both the design and the 
construction phases. The design 
cost per processing element is 
reduced asymptotically to zero as 
the processing element is 
replicated. The economy during 
construction is extremely important 
for semiconductors, since the unit 
replication cost of very large scale 
integrated-circuit chips varies 
nearly inversely with the replication 



factor, except for a small additive 
base cost. 

Another economy is the poten- 
tial reduction in the time between 
the design of the system and the 
delivery of the first operational unit. 
By replicating a relatively simple 
processing element many times 
and using a regular interconnection 
network, this time lag can be made 
very small; it is virtually indepen- 
dent of the processing power of the 
total system. As a result, the 
semiconductor technology used in 
a properly designed multiprocessor 
can be much more up to date than 
the technology used in a more 
complex processing structure. One 
additional economy results from 
the freedom of the multiprocessor 
designer to choose the most cost- 
effective uniprocessor element 
structure, regardless of the process- 
ing rate of the element. 

Independent of these economic 
advantages is the advantage of size. 
Regardless of whether it is 
economical to build arbitrarily 
powerful uniprocessors, at some 
point it becomes physically impossi- 
ble (with state-of-the-art 
technology). Multiprocessors, 
however, because of their 
modularity, can have larger 
processing rates. This advantage of 
multiprocessor structures is impor- 
tant because maximal computing 
rates will be necessary for certain 
applications (numerical weather 
prediction with its real-time con- 
straints, for example) into the 
foreseeable future. 



S-l Multiprocessors 

We are developing a mul- 
tiprocessor that computes at an un- 
precedented aggregate rate on a 
wide variety of problems. Figure 1 
is an artist's conception of the 
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The S-l Mark llA Multiprocessor system as it 
might be assembled in a computer center. 
The system includes 16 S-l MarkllA Uniprocessors (the 
beige and blue booklike devices arranged in two rows of 
8 each along the sides of the room), 16 main memory banks 
(housed 4 each in the 2 blue cabinets on each side of the 
room near the middle of the rows of Uniprocessors), 2 
Crossbar Switches (the X-shaped devices in the middle of 



the room) for transferring data between the Uniprocessors 
and the main memory, and peripheral equipment at the far 
end of the room, including disk drives, tape drives, printers, 
and a control console. The main memory shown consists of 
128 million bytes but is expandable up to 16 billion bytes. 
The compact arrangement shown is not essential; the 
Uniprocessors and the memory banks may be hundreds of 
feet apart. 



system. The S-l Mark II A Mul- 
tiprocessor, to be implemented 
with second-generation S-l 
Uniprocessors, each about as 
powerful as a Cray- 1 computer, will 
have a computation rate roughly 
ten times that of the Cray-1. The 
Cray-1, in turn, has a performance 
two to four times greater than that 
of the CDC 7600 and outperforms 
all other existing computers in 
general numerical computation 
work. 

Logical structure. A typical SI 
Multiprocessor consists of 16 in- 
dependent, identical S-l Uni- 
processors. Figure 2 shows the 
logical structure of the Mark IIA 
Multiprocessor. All 16 unipro- 
cessors are connected to main 
memory through the S-l Crossbar 
Switch. Each of the 16 memory 
banks can contain up to 1 billion 
bytes of semiconductor memory. 
Input and output are handled by 
peripheral processors (for example, 
LSI-lls); as many as eight can be 
attached to each S-l Mark IIA Uni- 
processor. The Synchronization 
Box is a shared bus connected to 
each member uniprocessor; one of 
its major functions is to transmit in- 
terrupts and small data packets 
from one uniprocessor to any sub- 
set of other uniprocessors in order 
to coordinate processing streams. 
Each module in an S-l Mul- 
tiprocessor is connected to a 
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5) The logical structure of the S-l Mark IIA Multiprocessor, greatly 
-^ simplified. Only the first and last of the 16 Uniprocessors, one of 
the two high-bandwidth Crossbar Switches, and the first and last of the 16 main 
memory banks are shown. As indicated, each of the Uniprocessors commu- 
nicates with any part of the main memory through the Crossbar Switch. In the 
access pattern shown (dots at intersections of the Crossbar-Switch grid), each 
of the Uniprocessors is connected to a different memory bank. When two or 
more Uniprocessors request the same memory bank, the Crossbar Switch en- 
forces queuing rules that guarantee each contender one turn in the contested 
memory bank before any other Uniprocessor has two. Private caches with very 
fast but expensive memory components within the member Uniprocessors effec- 
tively hide the combined latency of the switch and memory system. 
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diagnostics-and-maintenance 
processor (an LSI-11) that allows 
convenient remote display-oriented 
maintenance and control of the 
multiprocessor. 

All 16 S-l Uniprocessors can ex- 
ecute independent instruction 
streams on independent data 
streams. Thus, all 16 uniprocessors 
can cooperate in the solution of a 
single large problem (for example, 
by means of a Monte Carlo-based 
algorithm, an increasingly popular 
and easily partitioned approach to 
physical simulation). The high- 
bandwidth, low-latency inter- 
processor communications 
provided by the Crossbar Switch 
facilitate the partitioning of physical 
simulation problems with little ef- 
ficiency loss, but the 16 uni- 
processors can also process com- 
pletely independent tasks, so that 
each S-l Uniprocessor might ser- 
vice a different set of users. 
Memory requests from the member 
uniprocessors are serviced by 16 
memory banks with an aggregate 
maximum capacity of 16 billion 
nine-bit bytes. Any processor can 
uniformly access all of main 
memory through the S-l Crossbar 
Switch. The programmer thus sees 
a huge, uniform address space, 
because each memory request 
from each uniprocessor is decoded 
by hardware in the Crossbar Switch 
and sent to the appropriate 
memory bank. The Crossbar 
Switch has a maximum peak 
bandwidth of more than 10 billion 
bits per second when all its 16 



channels are transferring data 
simultaneously. 

Cache memory. The design of 
the S-l Multiprocessor allows com- 
ponent uniprocessors and memory 
banks to be physically distributed 
over distances that are limited only 
by average bandwidth require- 
ments (which degrade linearly with 
increasing cable length). To reduce 
the delays in accessing main 
memory that result from long 
cables, Crossbar- Switch transaction 
time, and relatively slow (but highly 
cost-effective) memory chips, each 
member uniprocessor contains 
private cache memories. These 
caches automatically (that is, 
without guidance from the 
programmer) retain recently 
referenced data and instructions in 
a relatively small amount of 
ultrahigh-performance memory, in 
the expectation that those data will 
be referenced again soon. 
Whenever a reference to such a 
retained datum or instruction is 
made, the information is im- 
mediately delivered directly from 
the cache, thus eliminating the 
need for a main-memory trans- 
action. Although a similar efficiency 
can be realized if main memory 
contains a special high-speed area, 
such a design places on every 
programmer the burden of manag- 
ing a variety of memory systems in 
order to maximize the efficiency of 
program execution. 



The presence of caches in a mul- 
tiprocessor necessarily introduces 
problems of cache coherence; that 
is, each uniprocessor must be able 
to read or write data in the other 
caches without any observable 
inconsistencies. 2 Without a guaran- 
tee of cache coherence, program- 
ming of certain problems in a 
cache-based multiprocessor would 
be inconceivably difficult. 

The caches of the member uni- 
processors of S-l Multiprocessors 
are private in the sense that there 
are no special communication 
paths connecting the caches of one 
uniprocessor with the caches of any 
other uniprocessor; the cache 
coherence problem is therefore es- 
pecially challenging. To solve it, the 
S-l Multiprocessor includes a 
design closely related to one in- 
dependently proposed in Ref. 2. A 
small tag is associated with each 
16-word line in physical memory. 
This tag identifies the only member 
uniprocessor (if any) that has been 
granted permission to retain (that 
is, owns) the line with write access 
and all the processors that own the 
line with read access. The memory 
controller allows multiple 
processors to own a line with read 
access but responds with a special 



error flag when a request is 
received to grant read or write 
access for any line that is already 
owned with write access or to grant 
write access for any line that is 
already owned with read access. 
Any uniprocessor receiving such an 
access-denial response is responsi- 
ble for requesting (through a sim- 
ple interrupt mechanism) that 
other uniprocessors flush the con- 
tested line from their private 
caches. This procedure maintains 
cache coherence dynamically, and 
hence extremely efficiently, without 
requiring any effort by the 
programmer. 

Error detection and correction. 
For reliability, all single-bit errors 
that occur in memory transactions 
are automatically corrected, and all 
double-bit errors are detected, 
regardless of whether the errors oc- 
cur in the switch or in the memory 
system. For protection against 
single-point failures, the S-l Mul- 
tiprocessor allows permanent con- 
nection of multiple Crossbar 
Switches that can be selected elec- 
tronically; the S-l Multiprocessor 
can thus continue operating in the 
event of a single-switch failure. 
Furthermore, the Crossbar Switch 
can be configured electronically to 
keep a backup copy of every datum 
in memory, so that failure of any 



memory bank will not entail loss of 
crucial data. Each input-output 
peripheral processor may be con- 
nected to input -output ports on at 
least two uniprocessors, so failure 
of a single uniprocessor does not 
isolate any input-output device 
from the multiprocessor system. To 
make maintenance easier, each 
member uniprocessor, each 
crossbar switch, and each memory 
bank is connected to a diagnostic 
computer that can probe, report, 
and change the internal state of all 
modules that it monitors, with very 
high time and logic resolution. 

S-l Uniprocessors 

We are developing a line of S-l 
Uniprocessors to serve as the com- 
putational nodes in the S-l Mul- 
tiprocessor. The first-generation 
S-l Uniprocessor (Mark I) has been 
implemented and evaluated in 
use, 3 the second-generation 
(Mark IIA) machine is under way, 
and future generations (Mark III, 
Mark IV, and Mark V) have been 
planned in varying amounts of 
detail. These generations of S-l 
Uniprocessors vary greatly in per- 
formance because of generation- 
to-generation advances in 
microcode, hardware structure, 
and implementation technology. 
However, all of them can conform 
to an identical instruction-set 
architecture, thereby making 
software transportable from uni- 
processors of any earlier genera- 
tion to those of any later one. 

Instruction-set architecture. 
The instruction-set architecture of a 
computer consists of those 



principles of its operation that a 
programmer without a stopwatch is 
capable of observing; that is, it in- 
cludes no timing information. The 
complete hardware and microcode 
structure that executes an 
instruction-set architecture is called 
the implementation. The im- 
plementation of the S-l Mark IIA 
Uniprocessor has been designed to 
allow high-speed emulation of 
several existing instruction-set 
architectures, including the DEC- 
10 and Univac AN/UYK-7, in addi- 
tion to the S-l instruction-set 
architecture (S-l Native Mode). 

It was apparent early in the S-l 
Multiprocessor design that no ex- 
isting instruction-set architecture 
was suitable to serve as the S-l 
Native Mode. Because then- 
existing instruction-set architectures 
had been designed under very dif- 
ferent technology constraints than 
those expected to apply to S-l 
systems, they variously suffered 
from address-space inadequacy, in- 
sufficient operations-code space, 
insufficient multiprocessing- 
oriented features, or adverse im- 
plications for high-performance im- 
plementations. 

In response to this situation, we 
developed the S-l Native Mode, 
which is probably the most widely 
reviewed high-performance com- 
puter architecture ever developed. 
Unlike the instruction-set architec- 
tures of previous high-performance 
computers (for example, the CDC 
STAR-100 or the Cray-1), which 
were developed by a few designers 
working behind corporate 
proprietary screens and were then 
frozen, the S-l Native Mode has 
been analyzed, criticized, and 
revised by scores of computer 
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scientists, engineers, and applica- 
tion specialists in industry, 
academia, and Government 
throughout the country. It has 
evolved over a period of three 
years, during design, implementa- 
tion, and operational evaluation of 
the S-l Mark I Uniprocessor 
prototype and during design of the 
S-l MarkllA Uniprocessor. 

As a consequence of this un- 
precedentedly extensive peer 
review, the S-l Native Mode is well 
developed — it contains a large, 
consistent set of features; it is highly 
extensible — it can easily include 
new features; it is general pur- 
pose — it contains features for com- 
piler and operating system ef- 
ficiency as well as for arithmetic- 
intensive and real-time applica- 
tions; and it is carefully tuned — it 
facilitates high-performance im- 
plementations of S-l Unipro- 
cessors and S-l Multiprocessors. 

The S-l Native Mode allows the 
programmer to address uniformly, 
without using base registers, 
2 billion nine-bit bytes of main 
memory, 288 times more memory 
than the Cray-1 (although relatively 
low-performance machines with 
large address spaces have recently 
appeared on the market). Indeed, it 
was primarily to provide for ade- 
quate address space that a 36-bit 
work length was adopted for the 
S-l Native Mode. 

Huge memories are crucial for 
efficient solution of large problems, 
such as three-dimensional physical 
simulations and Monte Carlo- 
intensive studies, which are of great 
current interest in a wide variety of 
applications that range from in- 
compressible fluid flow studies to 



acoustic ray tracing in highly 
stratified media. The large memory 
addressability of the S-l Native 
Mode essentially eliminates the 
programming costs associated with 
managing multiple types of com- 
puter system storage (for example, 
the SCM, LCM, drum, and disk 
memory hierarchy of the 
CDC 7600, to whose efficient 
management major portions of the 
careers of some programmers have 
been addressed). Memory 
technology has advanced so far 
since the development of small- 
address-space architectures such as 
the CDC 7600, the DEC PDP-10, 
and the IBM/360 that the current 
production cost to the S-l Project 
of a 2-billion-byte main memory us- 
ing 16K-bit memory chips is less 
than $10 million; its long-term rate 
of advance is so rapid that this cost 
can confidently be expected to 
decline by almost a factor of 2 each 
year for the next several years. 

Most software produced for S-l 
systems will be written in high-level 
compiled languages such as the 
developing DOD standard 
language, Ada. For ease of com- 
piler writing and for rapid, efficient 
execution of the compiled code, 
these languages require certain 
features in the underlying 
instruction-set architecture. S-l 
Native Mode is compiler-oriented: 
it is designed to support high-level 
languages in general, not one high- 
level language in particular, and it 
includes the full set of operators 



and addressing modes necessary 
for a simple compiler to produce 
efficient code. For example, S-l 
Native Mode supports expression 
evaluation with a unique type of 
2.5-operand instruction that allows 
the compilation of almost all forms 
of arithmetic expressions without 
using any move instructions 
(instructions which simply move 
data from one location to another 
without performing logical or 
arithmetic operations on them). 

The extent of the compiler- 
orientation of an instruction-set 
architecture can be roughly 
measured by counting the number 
of instructions necessary to repre- 
sent typical high-level language 
programs. We have observed that 
the CDC 7600 requires between 
two and three times as many in- 
structions to represent FORTRAN 
programs as does the S-l Native 
Mode; the bulk of additional 
CDC 7600 instructions are used in 
addressing computations. An ex- 
periment involving seven graduate- 
student programmers in the Com- 
puter Science Department at the 
University of California, Berkeley, 
showed that careful hand-coding of 
the PDP-11 requires an average of 
1.5 times as many instructions to 



represent a variety of high-level 
language programs as does the ST 
Native Mode. These and related 
considerations lead us to assert that 
no high-performance machine 
available today has a more 
compiler-oriented instruction-set 
architecture than the S-l Native 
Mode. 

The S-l Native Mode contains 
unprecedentedly comprehensive 
floating-point semantics. Floating- 
point numbers can be 18, 36, or 72 
bits long, using 5, 9, and 15 bits, 
respectively, to represent the expo- 
nent of 2, and 13, 27, and 57 bits, 
respectively, to represent the 
signed fraction. The largest format 
is upwards compatible with the 
floating-point format of the Cray-1. 
The 36-bit format was designed to 
be the workhorse for virtually all 
numerical applications. The 18-bit 
floating-point format was specially 
designed to support real-time 
signal processing at many hundred 
million floating-point operations 
per second, but it can be highly 
useful in any relatively low- 
precision application where 
processing speed is at a premium 
(as, for example, in Monte Carlo 
procedures). 

Compared to conventional 
floating-point representations, S-l 
floating-point formats offer one ex- 
tra bit of precision because the 
high-order bit of the fraction is 



determined from the sign and is 
not explicitly represented. The S-l 
Native Mode also allows floating- 
point operations to be correctly 
rounded in any of several different 
rounding modes. For example, 
stable rounding minimizes expect- 
ed error, and diminished- 
magnitude, augmented-magnitude, 
floor, and ceiling roundings can be 
used to measure the actual error 
developed. The S-l Native Mode 
includes special floating-point sym- 
bols (not-a-number, infinities, and 
epsilons) which allow programs to 
be created and exercised that will 
not malfunction because of tran- 
sient generation of quantities so 
large or small that they cannot be 
represented as ordinary numbers in 
the computer. A computer 
arithmetic system containing such 
symbols is essential for efficient use 
of human resources in developing 
and using robust computer 
programs.* 

Pipelining of instructions. 
Pipelining is exemplified by an 
automobile production line, in 
which a number of automobiles are 
in production simultaneously, each 
in a different stage of completion; 
the time between completion of 
construction of one automobile 
and the next is roughly the delay of 
a stage in the assembly line, rather 
than the time required for a single 
car to pass through the entire line. 
A stream of instructions in a 
pipelined computer implementa- 
tion is processed in a very similar 
fashion. 

The S-l Native Mode was 
designed especially to facilitate 



pipelined parallelism in the fetching 
and decoding of instructions, the 
associated fetching of instruction 
operands, and the eventual execu- 
tion of instructions. Pipelined 
parallelism is a conceptually simple 
type of parallelism that can result in 
extremely high computer perfor- 
mance levels. In general, designers 
of advanced instruction-set 
architectures for commercial com- 
puters have given little considera- 
tion to the implications of extensive 
pipelining, because they have 
developed those architectures with 
medium- or low-performance im- 
plementations in mind. Further- 
more, pipelining has thus far been 
used in modem computers 
primarily in the execution of in- 
structions, where it appears in the 
streaming of vectors of operands 
through pipelined arithmetic or 
logical operation functional units. 
S-l Uniprocessors pipeline the 
preparation and execution of in- 
structions that specify both scalar 
and vector operations. Every in- 
struction proceeds through multi- 
ple pipeline stages, including in- 
struction preparation, operand 
preparation, and execution. Some 
stages of the pipeline, particularly 
those dealing with operand address 
arithmetic and instruction execu- 
tion, necessarily have a wide variety 
of functions, since the pipeline 
must process a wide range of in- 
structions. This variability in opera- 
tion is effected through the exten- 
sive use of microcode, an 
architecture-defining, very low-level 
program that precisely specifies the 
operation of every pipeline stage. 
The variability built into the 
microcode-controlled pipeline 
facilitates high-performance 
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emulation of other computers (for 
example, the Navy's Univac 
AN/UYK-7). The S-l Mark I and 
Mark IIA Uniprocessors are the first 
high-performance machines to in- 
corporate instruction-preparation 
pipelines fully controlled by writable 
microcode. 

Structure and performance. 
Figure 3 shows the internal logical 
structure of the S-l Mark IIA Uni- 
processor. The machine consists of 
five microengines (extremely fast, 
relatively special-purpose program- 
mable controllers) operating in 



parallel to provide high perfor- 
mance. Four of the microengines 
form the instruction pipeline, con- 
sisting of the instruction-fetch, 
instruction -decode, operand- 
preparation, and arithmetic seg- 
ments. Some segments are inter- 
nally pipelined (a level of detail not 
shown in Fig. 3). A single 
microengine handles memory traf- 
fic in parallel with the operation of 
the instruction pipeline. A one- 
processor system can be con- 
figured by connecting an S-l 
Mark IIA Uniprocessor directly to a 
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memory controller; this requires 
neither hardware nor microcode 
changes. 

During the design of the S-l 
Mark I and Mark IIA pipelines, we 
made significant advances in com- 
puter technology. The Mark I in- 
troduced a new, simple branch- 
prediction strategy to predict the 
outcome of each test-and-branch 
operation in an instruction stream 
before its execution, thereby allow- 
ing subsequent instructions to be 
prepared without disruption. The 
Mark I also refined the use of dual 
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<3) The internal logical structure of the S-l 
\o cO) Mark IIA Uniprocessor. It consists of five dis- 
tinct microengines that operate in parallel. Four of the 
microengines control the instruction pipeline, consisting of 
instruction-fetch, instruction-decode, operand-preparation, 



and arithmetic segments, each of which in turn is pipelined. 
One microengine controls memory transactions. All five 
microengines are controlled with writable microcode. The 
Uniprocessor also contains private cache memory. 



cache memories (one for instruc- 
tions, one for data) to increase total 
cache bandwidth. The Mark IIA 
allows advance computation of 
simple operations in early pipeline 
stages; this technique minimizes idl- 
ing of pipeline stages because a 
computation (particularly, an 
operand-address computation) de- 
pends on some previous result. 
The Mark IIA includes refined con- 
trol mechanisms to coordinate the 
operation of multiple pipeline 
stages controlled by the indepen- 
dent programmable microengines. 

The S- 1 Mark IIA also employs 
vector operations to achieve high 
performance. Vector operations 
use multiple functional units in the 
pipelined arithmetic module, to 
achieve a peak computation rate 
on the S-l Mark IIA Uniprocessor 
of 400 million floating-point opera- 
tions per second. Any fatal error 
encountered during a vector opera- 
tion results in a precise interrupt, so 
the exact location of the error can 
be determined by the error- 
handling routine; this feature is 
regrettably rare on existing high- 
performance vector processors. 

Status and plans. The S-l 
Mark I was developed to be a 
prototype for evaluating the S-l 
Native Mode and its advanced 
hardware and to provide the 
necessary computational resources 



for the development of the S-l 
Mark IIA hardware and software. 
Only one S-l Mark I has been 
produced; it began operating late in 
1977. Constituted of 5350 ECL- 
10K integrated circuits, it was 
designed to execute floating-point 
arithmetic only in microcode 
emulation and also contained a 
severely reduced instruction- 
preparation pipeline. On a small set 
of floating-point-intensive scientific 
benchmark codes written in Pascal, 
the S-l Mark I has been observed 
to compute between 0.3 and 0.5 
times as fast as the CDC 7600, 
although the judicious use of hand- 
coded routines in crucial inner 
loops on the CDC 7600 was found 
to increase that machine's overall 
performance relative to the Pascal- 
programmed Mark I to a speed ad- 
vantage of fivefold, for our 16 000- 
line physical simulation code. Con- 
versely, the maximum execution 
rate of the Mark I (10 million in- 
structions per second), when com- 
bined with the powerful addressing 
modes and field-manipulating 
features of the S-l Native Mode 
instruction-set architecture, permits 
it to execute a variety of non- 
floating-point-intensive codes 
significantly more rapidly than does 
the CDC 7600. 

The second generation S-l Uni- 
processor, the Mark IIA executes 
the same instruction set (the S-l 
Native Mode) as the Mark I, but it 
has extensive hardware floating- 
point and vector operation 
capabilities. Its performance is ex- 
pected to be comparable to that of 



the Cray-1 on scientific problems 
expressed in high-level languages 
such as Ada, Pascal, and FOR- 
TRAN, for just those applications in 
which the single-word floating- 
point format of the S-l architecture 
is as useful as the substantially 
higher precision floating-point for- 
mat of the Cray-1. The Cray-1 will 
assuredly retain primacy in high- 
precision, vector-intensive data 
processing relative to near- term S-l 
Uniprocessors, since this type of 
computing capability cannot be 
justified for present or readily fore- 
seen Navy applications, most of 
which stress relatively low precision, 
very high throughput data process- 
ing. 

Table 1 shows the performance 
of the S-l Mark IIA Uniprocessor 
compared to the CDC 7600 and 
the Cray-1 on several important 
DOE benchmark miniprograms. 
These miniprograms are represen- 
tative of the full set used at LLL to 
compare the performance of ad- 
vanced scientific computers; they 
accurately and concisely charac- 
terize the computation-intensive 
portions of extensive scientific code 
at LLL. The S-l Mark IIA Uni- 
processor computes these 
benchmarks at roughly the same 
speed as the Cray-1 and almost 
twice as fast as the CDC 7600. The 
CDC 7600 rate was measured us- 
ing an optimizing compiler first 
available in 1974. The Cray-1 rates 
are based on actual performance 
measurements made in February, 
1979, with a moderately mature 
optimizing and vectorizing compiler 
supplied by Cray Research, Inc. 
Although the Cray-1 executes 
more instructions per second than 
the Mark IIA, many Cray-1 instruc- 
tions are expended in overhead 
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computations. The S-l results 
assume the use of 36-bit floating- 
point numbers, since high-precision 
arithmetic is often not necessary in 
LLL applications; however, neither 
the CDC 7600 nor the Cray-1 
provides a low-precision floating- 
point format. For applications re- 
quiring high precision, the Mark IIA 
supports operations on the 72-bit 
floating-point format at roughly half 
the speed of operations on the 36- 
bit floating-point fomnat. For low- 
precision applications, the Mark IIA 
supports operations on the 18-bit 
floating-point format at approx- 
imately twice the speed of 



operations on the 36-bit floating- 
point format. 

The S-l Mark IIA Uniprocessor 
is constituted of ECL- 100K MSI cir- 
cuits in performance-critical areas 
and ECL- 1 OK circuits elsewhere. 
All Mark IIA circuits are standard, 
commercially available products. 
The transistor population of the 
Mark IIA's arithmetic unit alone is 
greater than that of the entire cen- 
tral processing unit of the Cray-1; 
gate circuit densities within this 
arithmetic unit are about 20 times 
greater than those in the Cray-1 
central processing unit. The 
Mark IIA is in development at the 



present time; it is being packaged in 
the folded form shown in Fig. 4 
(shown unfolded on the cover). 

The S-l Mark III is in an early 
design phase. Like the Mark I and 
Mark IIA Uniprocessors, the 
Mark III executes the S-l Native 
Mode, but it is to be implemented 
completely in commercially 
available ECL-100K LSI circuits. 
While it will not achieve a large per- 
formance gain over the Mark IIA, it 
will be physically more compact 
because of its order-of-magnitude 
greater logic gate density. 

We are moving as rapidly as 
possible toward using the 
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technology of very large scale in- 
tegrated circuits. The first genera- 
tion presently planned to follow the 
Mark III will express the entire 
Mark III architecture on several 
VLSI chips, at a performance level 
at least as great as that of the 
Mark III. 

S-l Design System 

The capabilities offered by 
semiconductor technology for the 
implementation of advanced com- 
puter designs are rapidly outpacing 
the capabilities developed for ar- 
ticulating the conception of those 
designs. To make best use of 
rapidly improving semiconductor 
technology, we have developed the 
SCALD (Structured Computer- 
Aided Logic Design) System. 6 

SCALD is a graphics-based 
system for designing digital logic. It 
inputs a high-level description of a 
digital system and outputs 
magnetic tapes that are used by 
commercial automatic wire-wrap 
machines to build the hardware. 

The main advantage of using 
SCALD is a drastic reduction in the 
amount of time required to design 
a large digital system. This reduc- 
tion occurs because the designer 
can express his design in the same 



general level in which he thinks 
about it, freeing him from the task 
of actually drawing out all of the 
details of the logic and creating a 
wire list specifying its interconnec- 
tion. Designs expressed in this high- 
level notation become much more 
comprehensible for all those who 



have to work with them — for com- 
puters, for computer designers, and 
for maintenance engineers. By 
reducing the amount of clerical 
work required of digital logic 
designers, SCALD reduces the 
number of designers required to 
execute a design project and the 




( -~ £\ The S-l Mark llA Uniprocessor. The package consists of iden- 
S5° " tical pages. The pages unfold to expose all wire-wrap pins for 
maintenance. Ambient air blows up through the centers of the pages to cool the 
integrated circuits, which are mounted on the inside. Commercially available 
power supplies are mounted in the cabinet base. 
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communication overhead per 
designer, thus increasing each 
designer's productivity and further 
reducing the total designer require- 
ments of the project. Manpower 
savings well in excess of an order 
of magnitude may be realized; such 
savings have actually been 
demonstrated in practice during 
both the S-l Mark I and Mark IIA 
design efforts. 

SCALD allows designs to be 
recompiled rapidly when new in- 
tegrated circuits become available; 
such circuits may simply take the 
place of low-level modules. Thus, a 
designer can quite effectively use a 
previous design to reduce his 
design time on a new project, 
thereby taking maximum advan- 
tage of the exponential rates of ad- 
vance in component density and 
cost-effectiveness currently charac- 
terizing the semiconductor in- 
dustry. In practice, considerable 
work may still be required to up- 
date a design to incorporate recent 
technology advances, but the re- 
quired effort is likely to be much 
less than if the design were not ex- 
pressed hierarchically. 

SCALD also facilitates designing 
with very high accuracy, because 
SCALD performs design verifica- 
tion procedures that cannot be 
done by a human. Not only can 
SCALD verify syntactic details of 
the design (for example, that every 
gate input is connected to some 
output), but it can also verify that 
transmission lines are effectively 



free from signal reflections; it can 
certify that the logic networks 
defined by the designer do not con- 
tain timing errors, and it can 
demonstrate by simulation that the 
logical operation of the design is 
correct. 

Historically, logic design has 
lagged far behind program design 
in terms of the ideals of structured 
design: that arbitrary modules be 
specified, each in terms of a few 
other modules, relatively indepen- 
dently, and that they communicate 
through well-defined interfaces. 
Logic is still typically hand-drawn by 
draftsmen; the specification 
language consists of drawings of 
the primitive logical elements 
available from integrated-circuit 
manufacturers and the physical 
connections between those logical 
elements. On the other hand, 
typical modern programming 
systems readily support the design 
of arbitrary modules (that is, 
routines), each in terms of a few 
other routines, and allow the 
specification of tightly structured in- 
terfaces between those routines. 
SCALD simply expresses these 
performance-proven software- 
engineering concepts in the world 
of hardware design. 

SCALD consists of a set of com- 
puter programs. The Graphics 



Editor 7 enters drawings directly into 
a suitable computer, the Macro Ex- 
pander compiles them, and the 
Router embeds them in a physical 
packaging system. 

The Graphics Editor allows the 
designer to edit drawings at a 
graphics terminal and to print them 
out. The designer may create a 
library of shapes (macro bodies) 
that are generally abstractions of 
digital logic functions, though some 
may represent physical parts 
available from manufacturers. Each 
macro body is linked by name to a 
set of drawings, its macro defini- 
tion. A macro is defined only once 
but may be used in the drawings 
any number of times. The designed 
system is then made up by con- 
necting these macro bodies by lines 
indicating information flow. A 
single line in a drawing represents 
one or more signals (a signal vec- 
tor) and may be named. Macro 
bodies have parameters, including 
parameter signal vectors. Names 
on signal vectors include a timing 
notation that allows SCALD to 
verify automatically (using real or 
estimated delays of wires and in- 
tegrated circuits) that stated timing 
constraints will actually be satisfied 
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by the digital logic when implemen- 
ted in the specified physical 
package. 

Figure 5 shows a sample mid- 
level drawing from the Mark IIA 
design; it represents several thou- 
sands of integrated circuits. The 
drawing shows the Mark IIA data 
cache and register file, operand 
queues, alignment network, 
arithmetic module, and connec- 
tions between those elements. This 
drawing represents the described 
portion of the machine accurately, 



in that hardware is automatically 
built using the drawings as a 
specification, but it is lacking in 
detail and requires definitions of its 
submodules for completeness. 

The Macro Expander expands 
the design to individual integrated 
circuits by iteratively substituting 
the appropriate macro definition 
for each macro body in the 
drawings. The Macro Expander 
also verifies that designer-specified 
timing constraints are satisfied. The 
Macro Expander is largely 



technology-independent and is 
coded in transportable Pascal. 

The Router reads an intercon- 
nection list produced by the Macro 
Expander and produces magnetic 
tapes that permit the design to be 
implemented by automatic and 
semiautomatic commercial 
machines. Extensive maintenance 
and debugging documentation is 
produced by the Router, which is 
also coded in transportable Pascal. 

SCALD was used to design the 
S-l Mark I and the S-l Mark IIA. 
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, (S A typical SCALD drawing from a middle 
5° ^ level of the S-l Mark IIA Uniprocessor 
design. This whole drawing defines a single block in a more 
general diagram at the next higher level. Each of the blocks 
in the drawing shown here is defined by a more detailed 
drawing at the next lower level. Thus a hierarchy is es- 
tablished that stretches from the most general abstraction 



down to the individual components of the computer. Many of 
these drawings can be used over and over again in the 
design; they are drawn once and then simply recalled by 
SCALD as needed. SCALD also generates a wiring list, 
checks the design for mistakes and timing errors, and 
produces taped instructions for automatic wiring machines. 
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The Mark I design consisted of 211 
high-level drawings (drawings used 
only once in the design) and 144 
low-level drawings (drawings used 
several times). Low-level drawings 
form an investment in the par- 
ticular technology chosen for im- 
plementation, since they have a 
high probability of being used again 
in subsequent designs. In contrast, 
high-level drawings represent an in- 
vestment in the particular architec- 
ture being implemented and may 
be reused to recompile that 
architecture periodically into 
current, more cost-effective im- 
plementations. A total of two man- 
years was expended in the Mark I 
design work during an 
elapsed period of one calendar 
year. 

Structured logic design consists 
of extending to logic design the es- 
sential power of the concepts and 
tools developed for simplifying the 
programming task; the savings in 
human labor expended in design- 
ing digital systems are potentially as 
great as those resulting from the 
use of compilers. Our experience 
has shown that SCALD has made 
the S-l Mark I and Mark IIA 
designs more understandable, thus 
reducing the design efforts, enhan- 
cing design correctness, and 
facilitating generation of final 
documentation. The designs them- 
selves serve as major portions of 
the final documentation because 
they are so readily understood; 
thus, the need for expensive and 
usually inaccurate post facto 
documentation has been greatly 
reduced. Furthermore, SCALD has 
increased the mutability of these 
designs; since macros are in- 
herently isolated, changes in one 
macro definition usually require 
minimal changes in other parts of 



the design. Finally, the imposition 
of structure on the design and the 
use of computational resources in 
the verification task has resulted in 
designs of an unprecedented level 
of accuracy. 

Summary 

S-l Project effort related to the 
development of high-performance 
computing machines is directed 
toward three major areas: the S-l 
Multiprocessor, the S-l Unipro- 
cessor, and the S-l Design System. 
S-l Multiprocessors are rapidly ex- 
tensible to very high powers and 
large memory capacities at un- 
iprocessor cost-effectiveness levels 
and feature ultrareliable system 
performance. The S-l Uni- 
processors are general-purpose, 
emulation-oriented machines that 
are powerful and highly cost- 
effective and have advanced main- 
tainability features. The S-l Design 
System supports highly automated, 
general-purpose digital systems 
design and provides extensive con- 
struction and debug support of ad- 
vanced computer systems. 
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